In this quiz, I had to recognize and decide when to use which stats models for each question. The models showcased are a linear regression, multivariate linear regression, and one-way ANOVA.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pander)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## 
## The following object is masked from 'package:pander':
## 
##     wrap
library(broom)
coag <- read.csv("coagulation.csv", header = TRUE)
infmort <- read.csv("infmort.csv", header = TRUE)
pima<- read.csv("pima.csv", header = TRUE)
prostate<- read.csv("prostate.csv", header = TRUE)

#lm, mlreg, anova
#Hypothesis, EDA, hypothesis testing, covariance, normality, fit tests, visualization

Question 1; 1.NAs removed. See Below. 2. Linear Regression. Null Hypothesis: There is no statistically significant association between family history (diabetes) and those with and without diabetes(test). Alternative Hypothesis: There is statistically significant association between family history (diabetes) and those with and without diabetes(test). Significance will be set to 0.05, standard alpha level.

Results: P-value: 2e-16 There is statistically significant association between family history (diabetes) and those with and without diabetes(test). The Reject the Null. Adjusted R-squared = 0.029, the model doesn’t explain much of the variance seen Coefficients are significant Residuals= Median=-0.0967 Residual standard error=0.327 F-stat= 23.87

Conclusion: There is a link between family history and currently having diabetes. 3.I decided to keep test in the regression model so its portion of the explanation of variance is accounted for. Variables associated with diabetes = test, triceps, & insulin. 4.I decided to keep test in the regression model so its portion of the explanation of variance is accounted for. Variables associated with test = diabetes, pregnant, glucose, diastolic, & bmi. Makes sense if you have diabetes, then these variables are closely linked like glucose.

#1 Remove NAs
pima[, 2:6][pima[, 2:6]==0] = NA
head(pima)
#2
#Run Linear Regression for Assumption functions to makes sense
LinearReg <- lm(diabetes ~ factor(test), data = pima)
summary(LinearReg)
## 
## Call:
## lm(formula = diabetes ~ factor(test), data = pima)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46250 -0.22556 -0.09673  0.15227  1.89927 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.42973    0.01460  29.431  < 2e-16 ***
## factor(test)1  0.12077    0.02472   4.886 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3265 on 766 degrees of freedom
## Multiple R-squared:  0.03022,    Adjusted R-squared:  0.02896 
## F-statistic: 23.87 on 1 and 766 DF,  p-value: 1.255e-06
#Dummy variable: Test is an integer, but it is categorical ergo the dummy transformation
pimaDum <- mutate(pima, test.dum = factor(test))
LinearReg_Dum <- lm(diabetes~test.dum, pimaDum) #final linear regression, but to work with the assumptions below I put it here
summary(LinearReg_Dum)
## 
## Call:
## lm(formula = diabetes ~ test.dum, data = pimaDum)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46250 -0.22556 -0.09673  0.15227  1.89927 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.42973    0.01460  29.431  < 2e-16 ***
## test.dum1    0.12077    0.02472   4.886 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3265 on 766 degrees of freedom
## Multiple R-squared:  0.03022,    Adjusted R-squared:  0.02896 
## F-statistic: 23.87 on 1 and 766 DF,  p-value: 1.255e-06
#EDA
#Preliminary EDA
    #Summaries
    dim(pima)
## [1] 768   9
    str(pima)
## 'data.frame':    768 obs. of  9 variables:
##  $ pregnant : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ glucose  : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ diastolic: int  72 66 64 66 40 74 50 NA 70 96 ...
##  $ triceps  : int  35 29 NA 23 35 NA 32 NA 45 NA ...
##  $ insulin  : int  NA NA NA 94 168 NA 88 NA 543 NA ...
##  $ bmi      : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 NA ...
##  $ diabetes : num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age      : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ test     : int  1 0 1 0 1 0 1 0 1 1 ...
    summary(pima)
##     pregnant         glucose        diastolic         triceps     
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :29.00  
##  Mean   : 3.845   Mean   :121.7   Mean   : 72.41   Mean   :29.15  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##                   NA's   :5       NA's   :35       NA's   :227    
##     insulin            bmi           diabetes           age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0780   Min.   :21.00  
##  1st Qu.: 76.25   1st Qu.:27.50   1st Qu.:0.2437   1st Qu.:24.00  
##  Median :125.00   Median :32.30   Median :0.3725   Median :29.00  
##  Mean   :155.55   Mean   :32.46   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:190.00   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##  NA's   :374      NA's   :11                                      
##       test      
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000  
## 
    pima %>%
      group_by(factor(test)) %>%
      summarise(n = n(), mean = mean(diabetes), sd = sd(diabetes), 
      median = median(diabetes), IQR = IQR(diabetes)) %>% pander
factor(test) n mean sd median IQR
0 500 0.4297 0.2991 0.336 0.332
1 268 0.5505 0.3724 0.449 0.4655
    #Plots
    ggplot(pima, aes(test))+
      geom_bar()

    ggplot(pima, aes(diabetes))+
        geom_histogram()+
        geom_vline(aes(xintercept = mean(diabetes)), color = "red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    ggplot(pima, aes(diabetes))+
      geom_boxplot() #alot of outliers

    ggplot(pima, aes(x= factor(test), y= diabetes))+
        geom_boxplot()

  #Assumptions
    #1.Independent observations = The family history and having diabetes of a participant is not dependent on another participant's history or diabetes presence. 
    #2.Linearity = Not linear
      par(mfrow=c(2,2))
      plot(LinearReg) #Residual vs Fitted plot is linear, but only between two points

      crPlots(LinearReg) # Most of the variables' residuals are not linear
    #3 Homoscedasticity/ constant variability = Not homoscedastic.
      par(mfrow=c(2,2))

      plot(LinearReg)#See plot(multiReg)'s Scale-Location plot. The line is not horizontal and it only between two points.

    #4 Normality of Residuals = Residuals are not normal. Need to log(diabetes).
      #See plot(LinearReg)'s Nornal Q-Q plot OR the following: 
      pima.data <- augment(LinearReg_Dum)# get the residuals
      ggplot(pima.data, aes(sample = .std.resid)) +
geom_qq() +
stat_qq_line(color = "green") #Doesn't follow a straight line, not normal at all

     ggplot(pima.data, aes(x = .resid)) +
geom_histogram(bins = 35) #Normal enough, maybe a little platykurtic

     #Log function
      LinearReg1 <- lm(log(diabetes)~ test.dum, data = pimaDum)
      summary(LinearReg1)
## 
## Call:
## lm(formula = log(diabetes) ~ test.dum, data = pimaDum)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.62933 -0.47994 -0.03252  0.47709  1.89052 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.04508    0.02836 -36.851  < 2e-16 ***
## test.dum1    0.24399    0.04801   5.082 4.69e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6341 on 766 degrees of freedom
## Multiple R-squared:  0.03262,    Adjusted R-squared:  0.03136 
## F-statistic: 25.83 on 1 and 766 DF,  p-value: 4.689e-07
      crPlots(LinearReg1)
      par(mfrow=c(2,2))

      plot(LinearReg1)#A lot more normal now

      (length(boxplot(pima$diabetes, plot=F)$out)/768)*100 #I could remove outliers here, but I've decided against it since there are so many of them (3.8% of data are outliers). That seems like a lot to cut out.
## [1] 3.776042
#3 Variables associated with diabetes: test, triceps, & insulin
ExploreDiabetes <- lm(log(diabetes) ~ test.dum + pregnant + glucose + diastolic + triceps + insulin + bmi + age, pimaDum)
summary(ExploreDiabetes) %>% pander
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9295 0.2424 -3.834 0.0001473
test.dum1 0.2436 0.0805 3.025 0.00265
pregnant -0.01207 0.01332 -0.9064 0.3653
glucose -0.0003842 0.001398 -0.2748 0.7836
diastolic -0.005291 0.00274 -1.931 0.05419
triceps 0.001903 0.004006 0.4751 0.635
insulin 0.0001546 0.0003251 0.4755 0.6347
bmi 0.006663 0.00624 1.068 0.2864
age 0.005521 0.004451 1.24 0.2156
Fitting linear model: log(diabetes) ~ test.dum + pregnant + glucose + diastolic + triceps + insulin + bmi + age
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
392 0.6127 0.06126 0.04165
#4 Variables associated with diabetes: diabetes, pregnant, glucose, diastolic, & bmi
ExploreTest <- lm(test ~ diabetes + pregnant + glucose + diastolic + triceps + insulin + bmi + age, pima)
summary(ExploreTest) %>% pander
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.103 0.1436 -7.681 1.337e-13
diabetes 0.1572 0.05804 2.708 0.007066
pregnant 0.01295 0.008364 1.549 0.1223
glucose 0.006409 0.0008159 7.855 4.071e-14
diastolic 5.465e-05 0.00173 0.03158 0.9748
triceps 0.001678 0.002522 0.6652 0.5063
insulin -0.0001233 0.0002045 -0.6031 0.5468
bmi 0.009325 0.003901 2.391 0.0173
age 0.005878 0.002787 2.109 0.03559
Fitting linear model: test ~ diabetes + pregnant + glucose + diastolic + triceps + insulin + bmi + age
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
392 0.3853 0.3458 0.3321

Question 2: 1.Small sample is a problem for ANOVAs. It’s a assumption that is not as robust as other assumptions. There isn’t enough points to be normal and outliers could really affect data that is too small. 2.ANOVA Null Hypothesis: There is no difference between coagulation time means between diet types. Alternative Hypothesis: There is a difference between at least one of the coagulation time means between diet types. Significance will be set to 0.05, standard alpha level.

ANOVA results: There is a statistically significant between means. Reject the null. P-value=4.66e^05 F=13.57 Bonferroni = D is significantly different from B & C. A is sig dif from B & C.

Conclusion: There is a difference between diets and coagulation times. The diets that differ are D from B & C and A differs from B & C.

#1
#EDA
#Peliminary EDA
    dim(coag)
## [1] 24  2
    str(coag)
## 'data.frame':    24 obs. of  2 variables:
##  $ coag: int  62 60 63 59 63 67 71 64 65 66 ...
##  $ diet: chr  "A" "A" "A" "A" ...
    summary(coag)
##       coag           diet          
##  Min.   :56.00   Length:24         
##  1st Qu.:61.75   Class :character  
##  Median :63.50   Mode  :character  
##  Mean   :64.00                     
##  3rd Qu.:67.00                     
##  Max.   :71.00
    summary(coag$coag)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   56.00   61.75   63.50   64.00   67.00   71.00
    summary(factor(coag$diet))
## A B C D 
## 4 6 6 8
    coag %>%
      group_by(diet) %>%
      summarise(n = n(), mean = mean(coag), sd = sd(coag), 
      median = median(coag), IQR = IQR(coag)) %>% pander
diet n mean sd median IQR
A 4 61 1.826 61 2.5
B 6 66 2.828 65.5 2.5
C 6 68 1.673 68 0.75
D 8 61 2.619 61.5 3.25
     #Plots
    ggplot(coag, aes(diet, fill= diet))+
      geom_bar()

    ggplot(coag, aes(coag))+
        geom_boxplot()+
        geom_vline(aes(xintercept = mean(coag)), color = "green")

    ggplot(coag, aes(x=coag, fill = diet)) +
      geom_boxplot()+
      facet_grid(diet ~ .)+
    theme(legend.position = "none")

  #Assumptions
    #1.Independent observations = Assumed in the experiment's design
    #2.Homogeneity = There is equal variance
      leveneTest(coag$coag, factor(coag$diet)) #Equal variance
    #3 Normality (especially because the sample size is small) = Not Skewed, but platykurtic. Not enough samples to really make an ideal normal         distribution.
      shapiro.test(coag$coag)#normal bc p-value is not significant
## 
##  Shapiro-Wilk normality test
## 
## data:  coag$coag
## W = 0.97759, p-value = 0.8476
      qplot(data=coag, sample= coag, color=diet)#not linear nor lined up
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

      ggplot(coag, aes(coag))+
      geom_histogram(binwidth=.9)# Not enough points to be normal.

    #4 No extreme outliers
      ggplot(coag, aes(coag))+
      geom_boxplot()+
        facet_wrap("diet")#No extreme outliers. Can't remove data with such a small sample size anyways.

#2
#Run ANOVA test:
anova <- aov(coag ~ diet, data = coag)
summary(anova) %>% pander
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
diet 3 228 76 13.57 4.658e-05
Residuals 20 112 5.6 NA NA
#POST-TEST
bonferroni<- pairwise.t.test(coag$coag, coag$diet, p.adjust.method = "bonferroni") #bonferroni over tukey bc its more conservative
bonferroni #D is significantly different from B & C. A is sig difF from B & C.
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  coag$coag and coag$diet 
## 
##   A       B       C      
## B 0.02282 -       -      
## C 0.00108 0.95266 -      
## D 1.00000 0.00518 0.00014
## 
## P value adjustment method: bonferroni

Question 3: 1.NA/Noted 2.Multiple Regression Null Hypothesis: There is no relationship between psa and the rest of the variables Alternative Hypothesis: There is a relationship between psa and the rest of the variables Significance will be set to 0.05, standard alpha level.

Results: P-value: 0.255882 There is no statistically significant association between psa and lcavol, lweight, svi, age, and lbph collectively. Fail to reject the Null. Adjusted R-squared = 0.6245 , the model explains 62.45% of the variability seen Coefficients are significant: lcavol, lweight, svi Residual standard error=0.7073 F-stat= 32.94

Conclusion: Though some coefficients were significant, the overall main effect wasn’t, so we failed to find a significant relationship between psa and the variables we narrowed down to be the most influential on psa.

#2 
#Run Multiple Regression for Assumption functions to makes sense
multiReg <- lm(lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45, data = prostate)
summary(multiReg)
## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + lcp + 
##     gleason + pgg45, data = prostate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7331 -0.3713 -0.0170  0.4141  1.6381 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.669337   1.296387   0.516  0.60693    
## lcavol       0.587022   0.087920   6.677 2.11e-09 ***
## lweight      0.454467   0.170012   2.673  0.00896 ** 
## age         -0.019637   0.011173  -1.758  0.08229 .  
## lbph         0.107054   0.058449   1.832  0.07040 .  
## svi          0.766157   0.244309   3.136  0.00233 ** 
## lcp         -0.105474   0.091013  -1.159  0.24964    
## gleason      0.045142   0.157465   0.287  0.77503    
## pgg45        0.004525   0.004421   1.024  0.30886    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7084 on 88 degrees of freedom
## Multiple R-squared:  0.6548, Adjusted R-squared:  0.6234 
## F-statistic: 20.86 on 8 and 88 DF,  p-value: < 2.2e-16
#EDA
  #Preliminary EDA
  dim(prostate)
## [1] 97  9
  str(prostate)
## 'data.frame':    97 obs. of  9 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
  summary(prostate)
##      lcavol           lweight           age             lbph        
##  Min.   :-1.3471   Min.   :2.375   Min.   :41.00   Min.   :-1.3863  
##  1st Qu.: 0.5128   1st Qu.:3.376   1st Qu.:60.00   1st Qu.:-1.3863  
##  Median : 1.4469   Median :3.623   Median :65.00   Median : 0.3001  
##  Mean   : 1.3500   Mean   :3.653   Mean   :63.87   Mean   : 0.1004  
##  3rd Qu.: 2.1270   3rd Qu.:3.878   3rd Qu.:68.00   3rd Qu.: 1.5581  
##  Max.   : 3.8210   Max.   :6.108   Max.   :79.00   Max.   : 2.3263  
##       svi              lcp             gleason          pgg45       
##  Min.   :0.0000   Min.   :-1.3863   Min.   :6.000   Min.   :  0.00  
##  1st Qu.:0.0000   1st Qu.:-1.3863   1st Qu.:6.000   1st Qu.:  0.00  
##  Median :0.0000   Median :-0.7985   Median :7.000   Median : 15.00  
##  Mean   :0.2165   Mean   :-0.1794   Mean   :6.753   Mean   : 24.38  
##  3rd Qu.:0.0000   3rd Qu.: 1.1786   3rd Qu.:7.000   3rd Qu.: 40.00  
##  Max.   :1.0000   Max.   : 2.9042   Max.   :9.000   Max.   :100.00  
##       lpsa        
##  Min.   :-0.4308  
##  1st Qu.: 1.7317  
##  Median : 2.5915  
##  Mean   : 2.4784  
##  3rd Qu.: 3.0564  
##  Max.   : 5.5829
    #Plots
    ggplot(prostate, aes(lpsa))+
        geom_histogram()+
        geom_vline(aes(xintercept = mean(lpsa)), color = "blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  #Assumptions
    #1.Independent observations = Assumed in design of experiment. Durbin-Watson could be applied.
    #2.Linearity = Not linear
      par(mfrow=c(2,2))
      plot(multiReg) #Residual vs Fitted plot is not linear

      crPlots(multiReg) # Most of the variables' residuals are not linear

      #Maybe transforming lcavol will help. (Even though I see the log transformation has been applied to the relevant variables.)
      prostate1 <- prostate %>% mutate(lcavol_c = scale(lcavol, scale = FALSE))
      multiReg1 <- lm(lpsa ~ lcavol_c+ I(lcavol_c^2) + lweight + age + lbph + svi + lcp + gleason + pgg45, data = prostate1)
      summary(multiReg1)
## 
## Call:
## lm(formula = lpsa ~ lcavol_c + I(lcavol_c^2) + lweight + age + 
##     lbph + svi + lcp + gleason + pgg45, data = prostate1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.73429 -0.37817 -0.01007  0.37917  1.63319 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.429977   1.338502   1.068  0.28832    
## lcavol_c       0.594688   0.094361   6.302 1.17e-08 ***
## I(lcavol_c^2)  0.011900   0.051249   0.232  0.81693    
## lweight        0.448784   0.172677   2.599  0.01098 *  
## age           -0.019295   0.011330  -1.703  0.09213 .  
## lbph           0.110603   0.060721   1.821  0.07197 .  
## svi            0.754191   0.250981   3.005  0.00347 ** 
## lcp           -0.109584   0.093203  -1.176  0.24290    
## gleason        0.047276   0.158584   0.298  0.76633    
## pgg45          0.004585   0.004453   1.030  0.30596    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7123 on 87 degrees of freedom
## Multiple R-squared:  0.655,  Adjusted R-squared:  0.6193 
## F-statistic: 18.35 on 9 and 87 DF,  p-value: < 2.2e-16
      crPlots(multiReg1)#It barely changed, so I will not be using this transformation going forward, but I wanted to check.

    #3 Homoscedasticity/ constant variability = Not homoscedastic.
      par(mfrow=c(2,2))
      plot(multiReg)#See plot(multiReg)'s Scale-Location plot. The line is not horizontal enough to be homoscedastic. Since a log transformation has already been performed it won't get much straighter than this.

    #4 Normality of Residuals = Residuals are not the most normal. It could be worse, but it is not great.
      #See plot(multiReg)'s Nornal Q-Q plot OR the following: 
      pros.data <- augment(multiReg)# get the residuals
      ggplot(pros.data, aes(sample = .resid)) +
geom_qq() +
stat_qq_line(color = "green") #Doesn't follow a straight line, not normal, but close

     ggplot(pros.data, aes(x = .resid)) +
geom_histogram(bins = 35) #Normal enough, maybe a little platykurtic

     (length(boxplot(prostate$lpsa, plot=F)$out)/97)*100 #I could remove outliers here, but I've decided against it since there are so many of them (4.1% of data are outliers). That is too much to cut out
## [1] 4.123711
    #5 No multicollinearity = Multicollinearity is present in a number of variables.
      ggpairs(data=prostate)

      ggcorr(prostate)#gleason & pgg45 are highly correlated and that makes sense. Pgg45 is based on gleason scores. lpsa & lcavol, lcp&svi, lcp&lcavol, lcp&pgg45,  lpsa&svi, and lcp&gleason are all relatively correlated, so multicollinearity is present.

#Which model best fits
  #by significance = lcavol, lweight, svi, with age and bph's p-val= 0.1
  #by backward BIC = lcavol, lweight, svi
    multiReg_back <- lm(lpsa ~ .,prostate)
    n <- nrow(prostate)
    back.bic <- step(multiReg_back, direction = "backward",  k= log(n), trace = 0)
    back.bic
## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + svi, data = prostate)
## 
## Coefficients:
## (Intercept)       lcavol      lweight          svi  
##     -0.2681       0.5516       0.5085       0.6662
  #by backward AIC = lcavol, lweight, svi, age, lbph,
    multiReg_back <- lm(lpsa ~ .,prostate)
    back.aic <- step(multiReg_back, direction = "backward", trace = 0)
    back.aic
## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = prostate)
## 
## Coefficients:
## (Intercept)       lcavol      lweight          age         lbph          svi  
##     0.95100      0.56561      0.42369     -0.01489      0.11184      0.72095
#Best model showdown!
multiReg_best1 <- lm(lpsa ~ lcavol + lweight + svi + age + lbph, prostate)
summary(multiReg_best1)#Adjusted r^2 = 0.6245, #The winner
## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + svi + age + lbph, data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.83505 -0.39396  0.00414  0.46336  1.57888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.95100    0.83175   1.143 0.255882    
## lcavol       0.56561    0.07459   7.583 2.77e-11 ***
## lweight      0.42369    0.16687   2.539 0.012814 *  
## svi          0.72095    0.20902   3.449 0.000854 ***
## age         -0.01489    0.01075  -1.385 0.169528    
## lbph         0.11184    0.05805   1.927 0.057160 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7073 on 91 degrees of freedom
## Multiple R-squared:  0.6441, Adjusted R-squared:  0.6245 
## F-statistic: 32.94 on 5 and 91 DF,  p-value: < 2.2e-16
multiReg_best2 <- lm(lpsa ~ lcavol + lweight + svi, prostate)
summary(multiReg_best2)#Adjusted r^2 = 0.6144
## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + svi, data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.72964 -0.45764  0.02812  0.46403  1.57013 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.26809    0.54350  -0.493  0.62298    
## lcavol       0.55164    0.07467   7.388  6.3e-11 ***
## lweight      0.50854    0.15017   3.386  0.00104 ** 
## svi          0.66616    0.20978   3.176  0.00203 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7168 on 93 degrees of freedom
## Multiple R-squared:  0.6264, Adjusted R-squared:  0.6144 
## F-statistic: 51.99 on 3 and 93 DF,  p-value: < 2.2e-16
#The R^2 for the first one is higher and therefore the predictor variables explain more of the variance with multiReg_best1 than multiReg_best2. Also the researchers are looking for more a predictive model than an explanatory model, so keeping age and lbph is the better choice. 

Question 4: 1.Noted 2.Regression with Interaction term Null Hypothesis: There is no effect of region on infant mortality rates even with the influence of income per capita accounted for. Alternative Hypothesis: There is an effect of region on infant mortality rates even with the influence of income per capita accounted for. Significance will be set to 0.10.

Results: P-value: 4.55e-13 There is affect of region on infant mortality rates even with the influence of income per capita accounted for. The Reject the Null. Adjusted R-squared = 0.227, the model explains 22.7% of the variability seen Coefficients are significant = Africa, Americas, and Europe regions along with interaction between Americas region and income Residual standard error=79.83 F-stat= 5.194

Conclusion: There is an effect of region on infant mortality rate. With the comparison point being Africa, the Americas and Europe decrease in mortality as compared to Africa. The Americas this especially true when factoring in the association income, which decreases as compared to Africa.

Comparing the two models, we find

#1
#Run Regression for Assumption functions to makes sense
Reg <- lm(mortality ~ region, data = infmort)
summary(Reg)
## 
## Call:
## lm(formula = mortality ~ region, data = infmort)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -87.29 -37.52  -5.76  16.91 553.83 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      142.29      13.64  10.431  < 2e-16 ***
## regionAmericas   -87.17      21.76  -4.005 0.000122 ***
## regionAsia       -46.12      20.50  -2.249 0.026755 *  
## regionEurope    -123.04      23.19  -5.306 7.07e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79.54 on 97 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.2556, Adjusted R-squared:  0.2326 
## F-statistic:  11.1 on 3 and 97 DF,  p-value: 2.494e-06
#EDA
  #Preliminary EDA
  dim(infmort)
## [1] 105   4
  str(infmort)
## 'data.frame':    105 obs. of  4 variables:
##  $ region   : chr  "Asia" "Europe" "Europe" "Americas" ...
##  $ income   : int  3426 3350 3346 4751 5029 3312 3403 5040 2009 2298 ...
##  $ mortality: num  26.7 23.7 17 16.8 13.5 10.1 12.9 20.4 17.8 25.7 ...
##  $ oil      : chr  "no oil exports" "no oil exports" "no oil exports" "no oil exports" ...
  sum(is.na(infmort))#4 NAs in mortality
## [1] 4
  infmort <- na.omit(infmort) 
  summary(infmort)
##     region              income       mortality          oil           
##  Length:101         Min.   :  50   Min.   :  9.60   Length:101        
##  Class :character   1st Qu.: 130   1st Qu.: 26.20   Class :character  
##  Mode  :character   Median : 334   Median : 60.60   Mode  :character  
##                     Mean   :1022   Mean   : 89.05                     
##                     3rd Qu.:1191   3rd Qu.:129.40                     
##                     Max.   :5596   Max.   :650.00
  infmort %>%
group_by(region) %>%
summarise(mean = mean(mortality)) %>% pander
region mean
Africa 142.3
Americas 55.12
Asia 96.17
Europe 19.26
  levels(factor(infmort$region))
## [1] "Africa"   "Americas" "Asia"     "Europe"
  #Plots
  ggplot(infmort, aes(mortality, fill= region))+
        geom_boxplot()+
      facet_grid(region ~ .)+
    theme(legend.position = "none")

  ggplot(infmort, aes(x = income, y = mortality, color = region))+
    geom_point(alpha = 0.7)

  #Assumptions
    #Assumptions
    #1.Independent of residuals = Assumed in design of experiment. 
    #2.Linearity = Not linear.
      par(mfrow=c(2,2))
      plot(Reg)

      crPlots(Reg) #Residuals vs Fitted plot looks not straight.
    #3 Homoscedasticity/ constant variability of resifduals = Not homoscedastic.
      par(mfrow=c(2,2))

      plot(Reg)#Scale-Location plot is not straight.

    #4 Normality of Residuals = Residuals are not normal. Skewed to the right & kurtotic.
      #See plot(Reg)'s Normal Q-Q plot OR the following: 
      infmort.data <- augment(Reg)# get the residuals
      ggplot(infmort.data, aes(sample = .resid)) +
geom_qq() +
stat_qq_line(color = "green") # The outliers are way off the line

     ggplot(infmort.data, aes(x = .resid)) +
geom_histogram(bins = 25)# Now this shows it isn't normal. Right skewed & kurtotic

     (length(boxplot(infmort$mortality, plot=F)$out)/101)*100 #I could remove outliers here, but I've decided against it since 3% of the data are outliers. That seems like a lot to cut out.
## [1] 2.970297
#Dummy variable  
infmortDum <- mutate(infmort, region.dum = factor(region))
Reg_Dum <- lm(mortality~region.dum, infmortDum)
summary(Reg_Dum)
## 
## Call:
## lm(formula = mortality ~ region.dum, data = infmortDum)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -87.29 -37.52  -5.76  16.91 553.83 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          142.29      13.64  10.431  < 2e-16 ***
## region.dumAmericas   -87.17      21.76  -4.005 0.000122 ***
## region.dumAsia       -46.12      20.50  -2.249 0.026755 *  
## region.dumEurope    -123.04      23.19  -5.306 7.07e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79.54 on 97 degrees of freedom
## Multiple R-squared:  0.2556, Adjusted R-squared:  0.2326 
## F-statistic:  11.1 on 3 and 97 DF,  p-value: 2.494e-06
#2 Answer
interaction <- lm(mortality ~ region.dum * income, data = infmortDum)
summary(interaction)
## 
## Call:
## lm(formula = mortality ~ region.dum * income, data = infmortDum)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -101.04  -31.36   -2.58   16.25  559.77 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               130.91725   15.55534   8.416 4.55e-13 ***
## region.dumAmericas        -66.30647   26.15577  -2.535   0.0129 *  
## region.dumAsia            -30.28530   24.14769  -1.254   0.2129    
## region.dumEurope          -96.26275   47.63292  -2.021   0.0462 *  
## income                      0.04163    0.02702   1.541   0.1268    
## region.dumAmericas:income  -0.05133    0.02982  -1.721   0.0886 .  
## region.dumAsia:income      -0.04842    0.03121  -1.552   0.1242    
## region.dumEurope:income    -0.04669    0.03018  -1.547   0.1253    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79.83 on 93 degrees of freedom
## Multiple R-squared:  0.2811, Adjusted R-squared:  0.227 
## F-statistic: 5.194 on 7 and 93 DF,  p-value: 5.055e-05
#Tried log to make it more normal, but it didn't do much
interaction.log <- lm(log(mortality) ~ region.dum * income, data = infmortDum)
infmort.data2 <- augment(interaction.log)
ggplot(infmort.data2, aes(sample = .std.resid)) +
geom_qq() +
stat_qq_line(color = "green")